128 research outputs found

    RSG: Beating Subgradient Method without Smoothness and Strong Convexity

    Full text link
    In this paper, we study the efficiency of a {\bf R}estarted {\bf S}ub{\bf G}radient (RSG) method that periodically restarts the standard subgradient method (SG). We show that, when applied to a broad class of convex optimization problems, RSG method can find an ϵ\epsilon-optimal solution with a lower complexity than the SG method. In particular, we first show that RSG can reduce the dependence of SG's iteration complexity on the distance between the initial solution and the optimal set to that between the ϵ\epsilon-level set and the optimal set {multiplied by a logarithmic factor}. Moreover, we show the advantages of RSG over SG in solving three different families of convex optimization problems. (a) For the problems whose epigraph is a polyhedron, RSG is shown to converge linearly. (b) For the problems with local quadratic growth property in the ϵ\epsilon-sublevel set, RSG has an O(1ϵlog(1ϵ))O(\frac{1}{\epsilon}\log(\frac{1}{\epsilon})) iteration complexity. (c) For the problems that admit a local Kurdyka-\L ojasiewicz property with a power constant of β[0,1)\beta\in[0,1), RSG has an O(1ϵ2βlog(1ϵ))O(\frac{1}{\epsilon^{2\beta}}\log(\frac{1}{\epsilon})) iteration complexity. The novelty of our analysis lies at exploiting the lower bound of the first-order optimality residual at the ϵ\epsilon-level set. It is this novelty that allows us to explore the local properties of functions (e.g., local quadratic growth property, local Kurdyka-\L ojasiewicz property, more generally local error bound conditions) to develop the improved convergence of RSG. { We also develop a practical variant of RSG enjoying faster convergence than the SG method, which can be run without knowing the involved parameters in the local error bound condition.} We demonstrate the effectiveness of the proposed algorithms on several machine learning tasks including regression, classification and matrix completion.Comment: Final version accepted by JML

    Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization

    Full text link
    Recently, {\it stochastic momentum} methods have been widely adopted in training deep neural networks. However, their convergence analysis is still underexplored at the moment, in particular for non-convex optimization. This paper fills the gap between practice and theory by developing a basic convergence analysis of two stochastic momentum methods, namely stochastic heavy-ball method and the stochastic variant of Nesterov's accelerated gradient method. We hope that the basic convergence results developed in this paper can serve the reference to the convergence of stochastic momentum methods and also serve the baselines for comparison in future development of stochastic momentum methods. The novelty of convergence analysis presented in this paper is a unified framework, revealing more insights about the similarities and differences between different stochastic momentum methods and stochastic gradient method. The unified framework exhibits a continuous change from the gradient method to Nesterov's accelerated gradient method and finally the heavy-ball method incurred by a free parameter, which can help explain a similar change observed in the testing error convergence behavior for deep learning. Furthermore, our empirical results for optimizing deep neural networks demonstrate that the stochastic variant of Nesterov's accelerated gradient method achieves a good tradeoff (between speed of convergence in training error and robustness of convergence in testing error) among the three stochastic methods.Comment: Added some references and more empirical result

    Accelerate Stochastic Subgradient Method by Leveraging Local Growth Condition

    Full text link
    In this paper, a new theory is developed for first-order stochastic convex optimization, showing that the global convergence rate is sufficiently quantified by a local growth rate of the objective function in a neighborhood of the optimal solutions. In particular, if the objective function F(w)F(\mathbf w) in the ϵ\epsilon-sublevel set grows as fast as ww21/θ\|\mathbf w - \mathbf w_*\|_2^{1/\theta}, where w\mathbf w_* represents the closest optimal solution to w\mathbf w and θ(0,1]\theta\in(0,1] quantifies the local growth rate, the iteration complexity of first-order stochastic optimization for achieving an ϵ\epsilon-optimal solution can be O~(1/ϵ2(1θ))\widetilde O(1/\epsilon^{2(1-\theta)}), which is optimal at most up to a logarithmic factor. To achieve the faster global convergence, we develop two different accelerated stochastic subgradient methods by iteratively solving the original problem approximately in a local region around a historical solution with the size of the local region gradually decreasing as the solution approaches the optimal set. Besides the theoretical improvements, this work also includes new contributions towards making the proposed algorithms practical: (i) we present practical variants of accelerated stochastic subgradient methods that can run without the knowledge of multiplicative growth constant and even the growth rate θ\theta; (ii) we consider a broad family of problems in machine learning to demonstrate that the proposed algorithms enjoy faster convergence than traditional stochastic subgradient method. We also characterize the complexity of the proposed algorithms for ensuring the gradient is small without the smoothness assumption

    Doubly Stochastic Primal-Dual Coordinate Method for Bilinear Saddle-Point Problem

    Full text link
    We propose a doubly stochastic primal-dual coordinate optimization algorithm for empirical risk minimization, which can be formulated as a bilinear saddle-point problem. In each iteration, our method randomly samples a block of coordinates of the primal and dual solutions to update. The linear convergence of our method could be established in terms of 1) the distance from the current iterate to the optimal solution and 2) the primal-dual objective gap. We show that the proposed method has a lower overall complexity than existing coordinate methods when either the data matrix has a factorized structure or the proximal mapping on each block is computationally expensive, e.g., involving an eigenvalue decomposition. The efficiency of the proposed method is confirmed by empirical studies on several real applications, such as the multi-task large margin nearest neighbor problem

    On Data Preconditioning for Regularized Loss Minimization

    Full text link
    In this work, we study data preconditioning, a well-known and long-existing technique, for boosting the convergence of first-order methods for regularized loss minimization. It is well understood that the condition number of the problem, i.e., the ratio of the Lipschitz constant to the strong convexity modulus, has a harsh effect on the convergence of the first-order optimization methods. Therefore, minimizing a small regularized loss for achieving good generalization performance, yielding an ill conditioned problem, becomes the bottleneck for big data problems. We provide a theory on data preconditioning for regularized loss minimization. In particular, our analysis exhibits an appropriate data preconditioner and characterizes the conditions on the loss function and on the data under which data preconditioning can reduce the condition number and therefore boost the convergence for minimizing the regularized loss. To make the data preconditioning practically useful, we endeavor to employ and analyze a random sampling approach to efficiently compute the preconditioned data. The preliminary experiments validate our theory

    Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ϵ)O(1/\epsilon)

    Full text link
    In this paper, we develop a novel {\bf ho}moto{\bf p}y {\bf s}moothing (HOPS) algorithm for solving a family of non-smooth problems that is composed of a non-smooth term with an explicit max-structure and a smooth term or a simple non-smooth term whose proximal mapping is easy to compute. The best known iteration complexity for solving such non-smooth optimization problems is O(1/ϵ)O(1/\epsilon) without any assumption on the strong convexity. In this work, we will show that the proposed HOPS achieved a lower iteration complexity of O~(1/ϵ1θ)\widetilde O(1/\epsilon^{1-\theta})\footnote{O~()\widetilde O() suppresses a logarithmic factor.} with θ(0,1]\theta\in(0,1] capturing the local sharpness of the objective function around the optimal solutions. To the best of our knowledge, this is the lowest iteration complexity achieved so far for the considered non-smooth optimization problems without strong convexity assumption. The HOPS algorithm employs Nesterov's smoothing technique and Nesterov's accelerated gradient method and runs in stages, which gradually decreases the smoothing parameter in a stage-wise manner until it yields a sufficiently good approximation of the original function. We show that HOPS enjoys a linear convergence for many well-known non-smooth problems (e.g., empirical risk minimization with a piece-wise linear loss function and 1\ell_1 norm regularizer, finding a point in a polyhedron, cone programming, etc). Experimental results verify the effectiveness of HOPS in comparison with Nesterov's smoothing algorithm and the primal-dual style of first-order methods.Comment: This is a long version of the paper accepted by NIPS 201

    Fast Sparse Least-Squares Regression with Non-Asymptotic Guarantees

    Full text link
    In this paper, we study a fast approximation method for {\it large-scale high-dimensional} sparse least-squares regression problem by exploiting the Johnson-Lindenstrauss (JL) transforms, which embed a set of high-dimensional vectors into a low-dimensional space. In particular, we propose to apply the JL transforms to the data matrix and the target vector and then to solve a sparse least-squares problem on the compressed data with a {\it slightly larger regularization parameter}. Theoretically, we establish the optimization error bound of the learned model for two different sparsity-inducing regularizers, i.e., the elastic net and the 1\ell_1 norm. Compared with previous relevant work, our analysis is {\it non-asymptotic and exhibits more insights} on the bound, the sample complexity and the regularization. As an illustration, we also provide an error bound of the {\it Dantzig selector} under JL transforms

    First-order Convergence Theory for Weakly-Convex-Weakly-Concave Min-max Problems

    Full text link
    In this paper, we consider first-order convergence theory and algorithms for solving a class of non-convex non-concave min-max saddle-point problems, whose objective function is weakly convex in the variables of minimization and weakly concave in the variables of maximization. It has many important applications in machine learning including training Generative Adversarial Nets (GANs). We propose an algorithmic framework motivated by the inexact proximal point method, where the weakly monotone variational inequality (VI) corresponding to the original min-max problem is solved through approximately solving a sequence of strongly monotone VIs constructed by adding a strongly monotone mapping to the original gradient mapping. We prove first-order convergence to a nearly stationary solution of the original min-max problem of the generic algorithmic framework and establish different rates by employing different algorithms for solving each strongly monotone VI. Experiments verify the convergence theory and also demonstrate the effectiveness of the proposed methods on training GANs.Comment: In this revised version, we changed title to "First-order Convergence Theory for Weakly-Convex-Weakly-Concave Min-max Problems" and added more experimental result

    Non-Convex Min-Max Optimization: Provable Algorithms and Applications in Machine Learning

    Full text link
    Min-max saddle-point problems have broad applications in many tasks in machine learning, e.g., distributionally robust learning, learning with non-decomposable loss, or learning with uncertain data. Although convex-concave saddle-point problems have been broadly studied with efficient algorithms and solid theories available, it remains a challenge to design provably efficient algorithms for non-convex saddle-point problems, especially when the objective function involves an expectation or a large-scale finite sum. Motivated by recent literature on non-convex non-smooth minimization, this paper studies a family of non-convex min-max problems where the minimization component is non-convex (weakly convex) and the maximization component is concave. We propose a proximally guided stochastic subgradient method and a proximally guided stochastic variance-reduced method for expected and finite-sum saddle-point problems, respectively. We establish the computation complexities of both methods for finding a nearly stationary point of the corresponding minimization problem

    Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity

    Full text link
    We study distributed optimization algorithms for minimizing the average of convex functions. The applications include empirical risk minimization problems in statistical machine learning where the datasets are large and have to be stored on different machines. We design a distributed stochastic variance reduced gradient algorithm that, under certain conditions on the condition number, simultaneously achieves the optimal parallel runtime, amount of communication and rounds of communication among all distributed first-order methods up to constant factors. Our method and its accelerated extension also outperform existing distributed algorithms in terms of the rounds of communication as long as the condition number is not too large compared to the size of data in each machine. We also prove a lower bound for the number of rounds of communication for a broad class of distributed first-order methods including the proposed algorithms in this paper. We show that our accelerated distributed stochastic variance reduced gradient algorithm achieves this lower bound so that it uses the fewest rounds of communication among all distributed first-order algorithms.Comment: significant addition to both theory and experimental result
    corecore